Exploratory data analysis (EDA)

Exploratory Data Analysis (EDA) plays a very important role in understanding the dataset. Whether you are going to build a Machine Learning Model or if it's just an exercise to bring out insights from the given data, EDA is the primary task to perform. While it's undeniable that EDA is very important, The task of performing Exploratory Data Analysis grows in parallel with the number of columns your dataset has got.

This is a generic exploratory data analysis notebook which will serve as a guideline in your future data exploration endeavours. You can always build on this and add more analysis/graphs according to your dataset/requirements.

  • Basic Data Summary
  • Missing Values Analysis
  • Data Distribution Analysis
  • Correlation Analysis
  • Visualization of high dimentional data

Prerequisites

For prerequisites we import the necessary libraries and load the files needed for our EDA

In [50]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder


# Comment this if the data visualisations doesn't work on your side
%matplotlib inline

plt.style.use('bmh')

Loading dataset in a data frame and printing top 5 rows to visualize attributes

This is a house sale price data set with 81 columns and 1460 records with SalePrice (Continuous) as the class label

In [30]:
df = pd.read_csv(r'C:\Users\Saif\Desktop\ProHack Competition\train.csv', na_values=['NaN'])
df.head()
Out[30]:
galactic year galaxy existence expectancy index existence expectancy at birth Gross income per capita Income Index Expected years of education (galactic years) Mean years of education (galactic years) Intergalactic Development Index (IDI) Education Index ... Intergalactic Development Index (IDI), female Intergalactic Development Index (IDI), male Gender Development Index (GDI) Intergalactic Development Index (IDI), female, Rank Intergalactic Development Index (IDI), male, Rank Adjusted net savings Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total Private galaxy capital flows (% of GGP) Gender Inequality Index (GII) y
0 990025 Large Magellanic Cloud (LMC) 0.628657 63.125200 27109.234310 0.646039 8.240543 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.052590
1 990025 Camelopardalis B 0.818082 81.004994 30166.793958 0.852246 10.671823 4.742470 0.833624 0.467873 ... NaN NaN NaN NaN NaN 19.177926 NaN 22.785018 NaN 0.059868
2 990025 Virgo I 0.659443 59.570534 8441.707353 0.499762 8.840316 5.583973 0.469110 0.363837 ... NaN NaN NaN NaN NaN 21.151265 6.534020 NaN NaN 0.050449
3 990025 UGC 8651 (DDO 181) 0.555862 52.333293 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 5.912194 NaN NaN 0.049394
4 990025 Tucana Dwarf 0.991196 81.802464 81033.956906 1.131163 13.800672 13.188907 0.910341 0.918353 ... NaN NaN NaN NaN NaN NaN 5.611753 NaN NaN 0.154247

5 rows × 80 columns

Data Summary/Statistics

Pandas.info() Shows # of rows, # of columns, non-null values and data type for each column, unique data types in the dataset with their numbers and memory occupied by the data frame

In [11]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3865 entries, 0 to 3864
Data columns (total 80 columns):
galactic year                                                                              3865 non-null int64
galaxy                                                                                     3865 non-null object
existence expectancy index                                                                 3864 non-null float64
existence expectancy at birth                                                              3864 non-null float64
Gross income per capita                                                                    3837 non-null float64
Income Index                                                                               3837 non-null float64
Expected years of education (galactic years)                                               3732 non-null float64
Mean years of education (galactic years)                                                   3502 non-null float64
Intergalactic Development Index (IDI)                                                      3474 non-null float64
Education Index                                                                            3474 non-null float64
Intergalactic Development Index (IDI), Rank                                                3432 non-null float64
Population using at least basic drinking-water services (%)                                2021 non-null float64
Population using at least basic sanitation services (%)                                    2015 non-null float64
Gross capital formation (% of GGP)                                                         1502 non-null float64
Population, total (millions)                                                               1271 non-null float64
Population, urban (%)                                                                      1271 non-null float64
Mortality rate, under-five (per 1,000 live births)                                         1271 non-null float64
Mortality rate, infant (per 1,000 live births)                                             1259 non-null float64
Old age dependency ratio (old age (65 and older) per 100 creatures (ages 15-64))           1264 non-null float64
Population, ages 15–64 (millions)                                                          1264 non-null float64
Population, ages 65 and older (millions)                                                   1264 non-null float64
Life expectancy at birth, male (galactic years)                                            1264 non-null float64
Life expectancy at birth, female (galactic years)                                          1264 non-null float64
Population, under age 5 (millions)                                                         1264 non-null float64
Young age (0-14) dependency ratio (per 100 creatures ages 15-64)                           1264 non-null float64
Adolescent birth rate (births per 1,000 female creatures ages 15-19)                       1252 non-null float64
Total unemployment rate (female to male ratio)                                             1237 non-null float64
Vulnerable employment (% of total employment)                                              1237 non-null float64
Unemployment, total (% of labour force)                                                    1237 non-null float64
Employment in agriculture (% of total employment)                                          1237 non-null float64
Labour force participation rate (% ages 15 and older)                                      1237 non-null float64
Labour force participation rate (% ages 15 and older), female                              1237 non-null float64
Employment in services (% of total employment)                                             1237 non-null float64
Labour force participation rate (% ages 15 and older), male                                1237 non-null float64
Employment to population ratio (% ages 15 and older)                                       1237 non-null float64
Jungle area (% of total land area)                                                         1234 non-null float64
Share of employment in nonagriculture, female (% of total employment in nonagriculture)    1237 non-null float64
Youth unemployment rate (female to male ratio)                                             1236 non-null float64
Unemployment, youth (% ages 15–24)                                                         1236 non-null float64
Mortality rate, female grown up (per 1,000 people)                                         1253 non-null float64
Mortality rate, male grown up (per 1,000 people)                                           1253 non-null float64
Infants lacking immunization, red hot disease (% of one-galactic year-olds)                1219 non-null float64
Infants lacking immunization, Combination Vaccine (% of one-galactic year-olds)            1219 non-null float64
Gross galactic product (GGP) per capita                                                    1202 non-null float64
Gross galactic product (GGP), total                                                        1202 non-null float64
Outer Galaxies direct investment, net inflows (% of GGP)                                   1169 non-null float64
Exports and imports (% of GGP)                                                             1144 non-null float64
Share of seats in senate (% held by female)                                                1123 non-null float64
Natural resource depletion                                                                 1132 non-null float64
Mean years of education, female (galactic years)                                           1140 non-null float64
Mean years of education, male (galactic years)                                             1138 non-null float64
Expected years of education, female (galactic years)                                       1109 non-null float64
Expected years of education, male (galactic years)                                         1108 non-null float64
Maternal mortality ratio (deaths per 100,000 live births)                                  1252 non-null float64
Renewable energy consumption (% of total final energy consumption)                         1235 non-null float64
Estimated gross galactic income per capita, male                                           1055 non-null float64
Estimated gross galactic income per capita, female                                         1055 non-null float64
Rural population with access to electricity (%)                                            1029 non-null float64
Domestic credit provided by financial sector (% of GGP)                                    1079 non-null float64
Population with at least some secondary education, female (% ages 25 and older)            1089 non-null float64
Population with at least some secondary education, male (% ages 25 and older)              1087 non-null float64
Gross fixed capital formation (% of GGP)                                                   1074 non-null float64
Remittances, inflows (% of GGP)                                                            1028 non-null float64
Population with at least some secondary education (% ages 25 and older)                    1051 non-null float64
Intergalactic inbound tourists (thousands)                                                 995 non-null float64
Gross enrolment ratio, primary (% of primary under-age population)                         1038 non-null float64
Respiratory disease incidence (per 100,000 people)                                         896 non-null float64
Interstellar phone subscriptions (per 100 people)                                          891 non-null float64
Interstellar Data Net users, total (% of population)                                       872 non-null float64
Current health expenditure (% of GGP)                                                      867 non-null float64
Intergalactic Development Index (IDI), female                                              916 non-null float64
Intergalactic Development Index (IDI), male                                                915 non-null float64
Gender Development Index (GDI)                                                             914 non-null float64
Intergalactic Development Index (IDI), female, Rank                                        893 non-null float64
Intergalactic Development Index (IDI), male, Rank                                          892 non-null float64
Adjusted net savings                                                                       912 non-null float64
Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total                  941 non-null float64
Private galaxy capital flows (% of GGP)                                                    874 non-null float64
Gender Inequality Index (GII)                                                              844 non-null float64
y                                                                                          3865 non-null float64
dtypes: float64(78), int64(1), object(1)
memory usage: 2.4+ MB

Pandas.describe() shows summary statistics for numerical data in the dataframe. In other words this shows the five number summary for each data variable

In [12]:
df.describe()
Out[12]:
galactic year existence expectancy index existence expectancy at birth Gross income per capita Income Index Expected years of education (galactic years) Mean years of education (galactic years) Intergalactic Development Index (IDI) Education Index Intergalactic Development Index (IDI), Rank ... Intergalactic Development Index (IDI), female Intergalactic Development Index (IDI), male Gender Development Index (GDI) Intergalactic Development Index (IDI), female, Rank Intergalactic Development Index (IDI), male, Rank Adjusted net savings Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total Private galaxy capital flows (% of GGP) Gender Inequality Index (GII) y
count 3.865000e+03 3864.000000 3864.000000 3837.000000 3837.000000 3732.000000 3502.000000 3474.000000 3474.000000 3432.000000 ... 916.000000 915.000000 914.000000 893.000000 892.000000 912.000000 941.000000 874.000000 844.000000 3865.000000
mean 1.000709e+06 0.872479 76.798111 31633.240872 0.825154 14.723296 10.283959 0.804246 0.745900 135.129178 ... 0.823561 0.844209 1.008465 121.754797 120.873428 21.252922 6.443023 22.261474 0.600733 0.082773
std 6.945463e+03 0.162367 10.461654 18736.378445 0.194055 3.612546 3.319948 0.176242 0.199795 52.449535 ... 0.185780 0.159041 0.087299 46.269362 46.795666 14.258986 4.804873 34.342797 0.205785 0.063415
min 9.900250e+05 0.227890 34.244062 -126.906522 0.292001 3.799663 1.928166 0.273684 0.189874 9.925906 ... 0.305733 0.369519 0.465177 23.224603 16.215151 -76.741414 -1.192011 -735.186886 0.089092 0.013036
25% 9.950060e+05 0.763027 69.961449 20169.118912 0.677131 12.592467 7.654169 0.671862 0.597746 92.262724 ... 0.690707 0.731264 0.965800 84.090816 82.232550 15.001028 4.113472 17.227899 0.430332 0.047889
50% 1.000000e+06 0.907359 78.995101 26600.768195 0.827300 14.942913 10.385465 0.824758 0.761255 135.914318 ... 0.835410 0.862773 1.029947 120.069916 121.057923 22.182571 5.309497 24.472557 0.624640 0.057820
75% 1.006009e+06 0.992760 84.558971 36898.631754 0.970295 17.123797 12.884752 0.939043 0.893505 175.301993 ... 0.970365 0.961369 1.068481 158.579644 157.815625 29.134738 6.814577 31.748295 0.767404 0.087389
max 1.015056e+06 1.246908 100.210053 151072.683156 1.361883 26.955944 19.057648 1.232814 1.269625 278.786613 ... 1.237661 1.182746 1.181230 232.720847 233.915373 61.903641 36.538462 95.941245 1.098439 0.683813

8 rows × 79 columns

In [13]:
#show the stats in tabular form 
df.describe().T
Out[13]:
count mean std min 25% 50% 75% max
galactic year 3865.0 1.000709e+06 6945.463143 990025.000000 995006.000000 1000000.000000 1.006009e+06 1.015056e+06
existence expectancy index 3864.0 8.724787e-01 0.162367 0.227890 0.763027 0.907359 9.927599e-01 1.246908e+00
existence expectancy at birth 3864.0 7.679811e+01 10.461654 34.244062 69.961449 78.995101 8.455897e+01 1.002101e+02
Gross income per capita 3837.0 3.163324e+04 18736.378445 -126.906522 20169.118912 26600.768195 3.689863e+04 1.510727e+05
Income Index 3837.0 8.251535e-01 0.194055 0.292001 0.677131 0.827300 9.702946e-01 1.361883e+00
... ... ... ... ... ... ... ... ...
Adjusted net savings 912.0 2.125292e+01 14.258986 -76.741414 15.001028 22.182571 2.913474e+01 6.190364e+01
Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total 941.0 6.443023e+00 4.804873 -1.192011 4.113472 5.309497 6.814577e+00 3.653846e+01
Private galaxy capital flows (% of GGP) 874.0 2.226147e+01 34.342797 -735.186886 17.227899 24.472557 3.174829e+01 9.594124e+01
Gender Inequality Index (GII) 844.0 6.007333e-01 0.205785 0.089092 0.430332 0.624640 7.674039e-01 1.098439e+00
y 3865.0 8.277313e-02 0.063415 0.013036 0.047889 0.057820 8.738930e-02 6.838127e-01

79 rows × 8 columns

Pandas.describe() for object(categorical) variables in dataframe

In [14]:
df.describe(include=['object']).T
Out[14]:
count unique top freq
galaxy 3865 181 Tucana Dwarf 26

Missing Value Analysis

In [15]:
#columns_with_missing_values contains missing values column name along with missing value percentage
columns_with_missing_values=pd.DataFrame()

columns_with_missing_values['columns_name']=df.columns
columns_with_missing_values["missing_value"]=0

for i in columns_with_missing_values['columns_name']:
    columns_with_missing_values["missing_value"][columns_with_missing_values['columns_name']==i]=df[i].isnull().sum()/len(df[i])*100

columns_with_missing_values=columns_with_missing_values.sort_values('missing_value',ascending=True)[columns_with_missing_values['missing_value']!=0.0]


#resetting index
columns_with_missing_values=columns_with_missing_values.reset_index(drop=True)

#plotting column name along with missing percentage
plt.figure(figsize=(18, 8))
plt.barh(columns_with_missing_values['columns_name'],columns_with_missing_values['missing_value'])
c:\users\saif\anaconda3\envs\prohack\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
c:\users\saif\anaconda3\envs\prohack\lib\site-packages\ipykernel_launcher.py:10: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  # Remove the CWD from sys.path while we load stuff.
Out[15]:
<BarContainer object of 77 artists>

Getting rid of features with more than 70% missing values + removing ID column

In [31]:
'''
Pandas.count() does not include NaN values
'''

#identify columns with more than 25% of the values 

df2 = df[[column for column in df if df[column].count() / len(df) >= 0.25]]

'''
Deleting ID column
'''
# del df2['Id']

'''
Printing list of dropped columns
'''
print("List of dropped columns:", end=" ")
for c in df.columns:
    if c not in df2.columns:
        print(c, end=", ")
print('\n')
df = df2
List of dropped columns: Respiratory disease incidence (per 100,000 people), Interstellar phone subscriptions (per 100 people), Interstellar Data Net users, total (% of population), Current health expenditure (% of GGP), Intergalactic Development Index (IDI), female, Intergalactic Development Index (IDI), male, Gender Development Index (GDI), Intergalactic Development Index (IDI), female, Rank, Intergalactic Development Index (IDI), male, Rank, Adjusted net savings , Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total, Private galaxy capital flows (% of GGP), Gender Inequality Index (GII), 

Data Distribution

Distribution of class label

In [32]:
print(df['y'].describe())
plt.figure(figsize=(9, 8))
sns.distplot(df['y'], color='b', bins=50, hist_kws={'alpha': 0.6});
count    3865.000000
mean        0.082773
std         0.063415
min         0.013036
25%         0.047889
50%         0.057820
75%         0.087389
max         0.683813
Name: y, dtype: float64

Distributions of other numerical variables

In [33]:
list(set(df.dtypes.tolist()))
Out[33]:
[dtype('int64'), dtype('float64'), dtype('O')]
In [34]:
df_num = df.select_dtypes(include = ['float64', 'int64'])
df_num.head()
Out[34]:
galactic year existence expectancy index existence expectancy at birth Gross income per capita Income Index Expected years of education (galactic years) Mean years of education (galactic years) Intergalactic Development Index (IDI) Education Index Intergalactic Development Index (IDI), Rank ... Rural population with access to electricity (%) Domestic credit provided by financial sector (% of GGP) Population with at least some secondary education, female (% ages 25 and older) Population with at least some secondary education, male (% ages 25 and older) Gross fixed capital formation (% of GGP) Remittances, inflows (% of GGP) Population with at least some secondary education (% ages 25 and older) Intergalactic inbound tourists (thousands) Gross enrolment ratio, primary (% of primary under-age population) y
0 990025 0.628657 63.125200 27109.234310 0.646039 8.240543 NaN NaN NaN NaN ... NaN 75.604799 NaN NaN 42.616284 NaN NaN NaN NaN 0.052590
1 990025 0.818082 81.004994 30166.793958 0.852246 10.671823 4.742470 0.833624 0.467873 152.522198 ... NaN 57.214150 57.314932 56.187355 29.908422 6.225946 44.780023 NaN 120.886080 0.059868
2 990025 0.659443 59.570534 8441.707353 0.499762 8.840316 5.583973 0.469110 0.363837 209.813266 ... NaN 76.141735 42.405827 53.927715 18.732049 4.138115 24.030945 NaN 96.626831 0.050449
3 990025 0.555862 52.333293 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.049394
4 990025 0.991196 81.802464 81033.956906 1.131163 13.800672 13.188907 0.910341 0.918353 71.885345 ... 134.967049 NaN 77.223935 75.475076 31.398393 NaN 66.674651 NaN NaN 0.154247

5 rows × 66 columns

In [41]:
df_num.hist(figsize=(60, 80), bins=50, xlabelsize=10, ylabelsize=10); 

Correlation Analysis

We find correlation of numerical attributes with class label y displayed in descending order of their correlations

In [36]:
df_num_corr = df_num.corr()['y'][:-1]
golden_features_list = df_num_corr.sort_values(ascending=False)
print(golden_features_list)
Old age dependency ratio (old age (65 and older) per 100 creatures (ages 15-64))    0.679981
Estimated gross galactic income per capita, female                                  0.667465
Intergalactic Development Index (IDI)                                               0.625114
Education Index                                                                     0.613938
Expected years of education (galactic years)                                        0.584069
                                                                                      ...   
Employment in agriculture (% of total employment)                                  -0.473959
Adolescent birth rate (births per 1,000 female creatures ages 15-19)               -0.491689
Vulnerable employment (% of total employment)                                      -0.496568
Young age (0-14) dependency ratio (per 100 creatures ages 15-64)                   -0.533741
Intergalactic Development Index (IDI), Rank                                        -0.681592
Name: y, Length: 65, dtype: float64

Correlation Heat Map

In [37]:
corr = df_num.drop('y', axis=1).corr() # We already examined class label y correlations
plt.figure(figsize=(15, 15))

sns.heatmap(corr, vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 4}, square=True);

Data Distributions

Feature distributions with respect to class label

In [40]:
for i in range(0, len(df_num.columns), 3):
    sns.pairplot(data=df_num,
                x_vars=df_num.columns[i:i+3],
                y_vars=['y'], size=5)
c:\users\saif\anaconda3\envs\prohack\lib\site-packages\matplotlib\pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Feature relationships with class label

In [44]:
features_to_analyse = list(df_num)
fig, ax = plt.subplots(round(len(features_to_analyse) / 3), 3, figsize = (18, 72))

for i, ax in enumerate(fig.axes):
    if i < len(features_to_analyse) - 1:
        sns.regplot(x=features_to_analyse[i],y='y', data=df[features_to_analyse], ax=ax)

Categorical Variables

In [45]:
# quantitative_features_list[:-1] as the last column is y and we want to keep it
df_cat = df.select_dtypes(include = ['O'])
df_cat['y'] = df['y']
categorical_features = list(df_cat)
df_cat.head()
c:\users\saif\anaconda3\envs\prohack\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
Out[45]:
galaxy y
0 Large Magellanic Cloud (LMC) 0.052590
1 Camelopardalis B 0.059868
2 Virgo I 0.050449
3 UGC 8651 (DDO 181) 0.049394
4 Tucana Dwarf 0.154247

Box Plots for categorical features agains class label

In [46]:
plt.figure(figsize = (10, 6))
ax = sns.boxplot(x='galaxy', y='y', data=df_cat)
plt.setp(ax.artists, alpha=.5, linewidth=2, edgecolor="k")
plt.xticks(rotation=45)
Out[46]:
(array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
         13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
         26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
         39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
         52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
         65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
         78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
         91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
        104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
        117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
        130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
        143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
        156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
        169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180]),
 <a list of 181 Text xticklabel objects>)

High Dimentional Data Analysis

Most real-world datasets have more than one feature. Each of them can be considered as a dimension in the space of data points. Consequently, more often than not, we deal with high-dimensional datasets, where entire visualization is quite hard.

To visualize the dataset as a whole, we need to decrease the number of dimensions used in visualization without losing much information about data. This task is called dimensionality reduction and is an example of an unsupervised learning problem because we need to derive new, low-dimensional features from the data itself, without any supervised input.

One of the well-known dimensionality reduction methods is Principal Component Analysis (PCA), which is covered in the previous lectures. Its limitation is that it is a linear algorithm that implies certain restrictions on the data.

There are also many non-linear methods, collectively called Manifold Learning. One of the best-known of them is t-SNE

t-distributed Stohastic Neighbor Embedding (t-SNE)

Basic idea is to find a projection for a high-dimensional feature space onto a plane (or a 3D hyperplane, but it is almost always 2D) such that those points that were far apart in the initial n-dimensional space will end up far apart on the plane. Those that were originally close would remain close to each other. Please refer to the t-SNE

In [51]:
#scaling the dataset 
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_num.fillna(0))

#tsne modeling
tsne = TSNE(n_components=2)
tsne_repr = tsne.fit_transform(X_scaled)
In [52]:
# converting into dataframe
tsne_repr = pd.DataFrame(tsne_repr, columns=['dim'+str(i) for i in range(1,3)])

# converting continous variable to class labels using binning
df['y']=pd.cut(np.array(df['y']),2,labels=["low_well_being", "high_well_being"])

label_encoder=LabelEncoder()
df['y']=label_encoder.fit_transform(df['y'])
In [53]:
#ploting it w.r.t to class variable i.e. assigning each class label colour
plt.figure(figsize=(9, 9))
plt.scatter(np.array(tsne_repr['dim1']), np.array(tsne_repr['dim2']),alpha=0.4,s=70,c=df['y'].map({0: 'green', 1: 'red'}))
Out[53]:
<matplotlib.collections.PathCollection at 0x1d7d62cf9e8>

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables, Spearman and Pearson matrixes
In [58]:
# import pandas_profiling
# pandas_profiling.ProfileReport(df)

Thanks Folk !